library("install.load")
install_load("tidyverse", "ggplot2", "readr", "plotly", "olsrr")
taxlotdata <- read_csv("~/Desktop/RFiles/689 Assigments/taxlot_sample.csv")
taxlot_condo <- taxlotdata %>%
filter(condo == 1)
depedent var = totalval
independent var = bldgsqft
g <- taxlot_condo %>%
ggplot(aes(BLDGSQFT, TOTALVAL, text = FID)) +
geom_point()
ggplotly(g, tootltip = c("FID"))
FID of outliers: 2613, 2647, 1110, 2128
All four of these outliers have massive building footprints, over 40,000ft2, much higher than any of the other buildings in the dataset. They don’t seem to be part of the same data and might skew the results if included.
newdata <- taxlot_condo %>%
filter(BLDGSQFT <= 40000,
TOTALVAL>0)
ggplot(newdata, aes(BLDGSQFT, TOTALVAL)) +
geom_point()
log-lin - this seems to be a good fit
log-log - this looks like a better fit
lin-log - this looks like a quadratic fit
To run an OLS test we need a linear relationship between the variables. I can’t really tell which is a better fit between the log-lin and the log-log by this visual test, but it’s obvious that the totalval variable needs to be transformed using log. I’m unsure of transforming building square footage using log would result in a better model. Further testing is needed to determine if building square footage should be transformed as well.
newdata %>%
ggplot(aes(log(TOTALVAL), BLDGSQFT)) +
geom_point()
newdata %>%
ggplot(aes(log(TOTALVAL), log(BLDGSQFT))) +
geom_point()
newdata %>%
ggplot(aes(TOTALVAL, log(BLDGSQFT))) +
geom_point()
Multiple R-squared: 0.5369, Adjusted R-squared: 0.5364
Multiple R-squared: 0.3927, Adjusted R-squared: 0.3921
Multiple R-squared: 0.4095, Adjusted R-squared: 0.4089
Multiple R-squared: 0.4108, Adjusted R-squared: 0.4102
To determine the best model we look for the highest Rsquared value. In these models, the model a has the highest R-squared value, meaning that model is the best fit. This is a bit confusing because the previous testing showed a good fit by transforming the total value variable using a log function. I think that model could be a good fit, but some outliers are pulling the linear model in both models. I’m not sure which is the best model so far - more testing needs to be completed!
lma <- lm(TOTALVAL ~ BLDGSQFT, newdata) %>%
summary()
lmb <- lm(log(TOTALVAL) ~ BLDGSQFT, newdata) %>%
summary()
lmc <- lm(log(TOTALVAL) ~ log(BLDGSQFT), newdata) %>%
summary()
lmd <- lm(TOTALVAL ~ log(BLDGSQFT), newdata) %>%
summary()
Residuals vs Fitted plot shows the data are not linearly related. The residual errors increase dramtically as X increases
The plot shows data errors are normally distributed. The line is about a 45 degree angle with some departure at the beginning and end, which is expected.
These data are not heteroskedastic according to the scale-location scatter plot because the fit line is not horizontal and trends signifigantly up as X increases.
Some data seems to be exerting a large influence on the data because some points fall outside the Cook’s .5 line. This is in line with previous tests which show a large group of data points clustered with some pulling the line signifigantly steeper. This test is helping us determine which points are throwing off the fit line.
lm_resfit <- lm(TOTALVAL ~ BLDGSQFT, newdata)
plot(lm_resfit, ask=FALSE)
not a good fit
also not a good fit
better…?
not the best
Based on these scatterplots I think c, or log-log, is the best fit, but it’s really hard to tell.
newdata %>%
ggplot(aes(TOTALVAL, dpioneer)) +
geom_point()
newdata %>%
ggplot(aes(log(TOTALVAL), dpioneer)) +
geom_point()
newdata %>%
ggplot(aes(log(TOTALVAL), log(dpioneer))) +
geom_point()
newdata %>%
ggplot(aes(TOTALVAL, log(dpioneer))) +
geom_point()
Multiple R-squared: 0.1276, Adjusted R-squared: 0.1267
Multiple R-squared: 0.2867, Adjusted R-squared: 0.286
Multiple R-squared: 0.25, Adjusted R-squared: 0.2492
Multiple R-squared: 0.1263, Adjusted R-squared: 0.1254
Based on the R-squared stats model b, or log-lin, is the best model.
lma1 <- lm(TOTALVAL ~ dpioneer, newdata) %>%
summary()
lmb1 <- lm(log(TOTALVAL) ~ dpioneer, newdata) %>%
summary()
lmc1 <- lm(log(TOTALVAL) ~ log(dpioneer), newdata) %>%
summary()
lmd1 <- lm(TOTALVAL ~ log(dpioneer), newdata) %>%
summary()
not a good fit - one or more of the variables needs to be transformed using a log
this could be an okay fit.
better…? data are grouped in a cloud in the center of the plot
fairly linear, but difficult to tell.
Based on these scatterplots I think c, or log-log, is the best fit, but it’s really hard to tell.
newdata %>%
ggplot(aes(TOTALVAL, dmax)) +
geom_point()
newdata %>%
ggplot(aes(log(TOTALVAL), dmax)) +
geom_point()
newdata %>%
ggplot(aes(log(TOTALVAL), log(dmax))) +
geom_point()
newdata %>%
ggplot(aes(TOTALVAL, log(dmax))) +
geom_point()
Multiple R-squared: 0.05245, Adjusted R-squared: 0.05148
Multiple R-squared: 0.09992, Adjusted R-squared: 0.099
Multiple R-squared: 0.07203, Adjusted R-squared: 0.07109
Multiple R-squared: 0.03544, Adjusted R-squared: 0.03446
Multiple R-squared: 0.05565, Adjusted R-squared: 0.05372
Based on the R-squared stats model b, or log-lin, is the best model.
lma2 <- lm(TOTALVAL ~ dmax, newdata) %>%
summary()
lmb2 <- lm(log(TOTALVAL) ~ dmax, newdata) %>%
summary()
lmc2 <- lm(log(TOTALVAL) ~ log(dmax), newdata) %>%
summary()
lmd2 <- lm(TOTALVAL ~ log(dmax), newdata) %>%
summary()
lme2 <- lm(TOTALVAL ~ dmax+I(dmax^2), newdata) %>%
summary()
After testing the pairwise models it’s unclear if the best model is a linear representaiton of TOTALVAL or a log of TOTALVAL, so regression test is useful. After running the regression test it’s obvious log(TOTALVAL) is a better fit. TOTALVAL : Multiple R-squared: 0.7036 log(TOTALVAL): Multiple R-squared: 0.7284 72% > 70%, so log(TOTALVAL) is a better model
all VIF are below 2 so multicollinarity doesn’t seem to be present.
the line is roughly horizontal, meaning the means of the residuals are fairly linear
according to the Q-Q plot, the data are fairly normal. The points bascially fall along a 45 degree line with some departure at the ends of the data.
*************** I don’t know how to interepret the breusch pagan test
lm(TOTALVAL ~ BLDGSQFT + dpioneer + dmax, newdata) %>%
summary
##
## Call:
## lm(formula = TOTALVAL ~ BLDGSQFT + dpioneer + dmax, data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1116139 -62708 -10343 42980 1629457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69922.75 11546.19 6.056 1.99e-09 ***
## BLDGSQFT 340.62 7.82 43.558 < 2e-16 ***
## dpioneer -47182.66 2520.30 -18.721 < 2e-16 ***
## dmax -6382.66 7065.21 -0.903 0.367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 154200 on 978 degrees of freedom
## Multiple R-squared: 0.7036, Adjusted R-squared: 0.7027
## F-statistic: 773.7 on 3 and 978 DF, p-value: < 2.2e-16
lm(log(TOTALVAL) ~ BLDGSQFT + dpioneer + dmax, newdata) %>%
summary
##
## Call:
## lm(formula = log(TOTALVAL) ~ BLDGSQFT + dpioneer + dmax, data = newdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7015 -0.1706 0.0204 0.1964 0.8372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.214e+01 2.461e-02 493.131 <2e-16 ***
## BLDGSQFT 6.648e-04 1.667e-05 39.880 <2e-16 ***
## dpioneer -1.534e-01 5.373e-03 -28.559 <2e-16 ***
## dmax 3.764e-04 1.506e-02 0.025 0.98
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3288 on 978 degrees of freedom
## Multiple R-squared: 0.7284, Adjusted R-squared: 0.7276
## F-statistic: 874.5 on 3 and 978 DF, p-value: < 2.2e-16
lm_TFPD <- lm(log(TOTALVAL) ~ BLDGSQFT + dpioneer + dmax, newdata)
ols_vif_tol(lm_TFPD)
## # A tibble: 3 x 3
## Variables Tolerance VIF
## <chr> <dbl> <dbl>
## 1 BLDGSQFT 0.995 1.01
## 2 dpioneer 0.671 1.49
## 3 dmax 0.673 1.48
plot(lm_TFPD, ask=FALSE)
ols_test_normality(lm_TFPD)
## -----------------------------------------------
## Test Statistic pvalue
## -----------------------------------------------
## Shapiro-Wilk 0.8904 0.0000
## Kolmogorov-Smirnov 0.0686 2e-04
## Cramer-von Mises 173.7969 0.0270
## Anderson-Darling 8.0214 0.0000
## -----------------------------------------------
ols_test_breusch_pagan(lm_TFPD)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## -----------------------------------------
## Response : log(TOTALVAL)
## Variables: fitted values of log(TOTALVAL)
##
## Test Summary
## --------------------------------
## DF = 1
## Chi2 = 994.1349
## Prob > Chi2 = 3.381657e-218
vcov0 <- vcov(lm_TFPD)
require(car)
vcov_hc3 <- hccm(lm_TFPD, type = "hc3")
if(!require(lmtest))
install.packages("lmtest") & library(lmtest)
coeftest(lm_TFPD, vcov_hc3)
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.13818343 0.10849546 111.8773 < 2.2e-16 ***
## BLDGSQFT 0.00066482 0.00010277 6.4688 1.558e-10 ***
## dpioneer -0.15344513 0.00657837 -23.3257 < 2.2e-16 ***
## dmax 0.00037642 0.01361227 0.0277 0.9779
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
discuss the t-test and significance of each coefficent
explain how each coefficient relations to totalval
goodness of fit (r2)
limitations of the model